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Abstract 

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We 
extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, 
characterized by its recall and its precision. In this framework, we provide an optimal algorithm to 
decide when to take predictions into account, and we derive the optimal value of the checkpointing 
period. These results allow to analytically assess the key parameters that impact the performance 
of fault predictors at very large scale. 

1 Introduction 

Nowadays, the most powerful High Performance Computing systems experience about one fault per 
day [2 [5] . Consider the relative slopes describing the evolution of the reliability of individual components 
on one side, and the evolution of the number of components on the other side: the reliability of an 
entire platform is expected to decrease, due to probabilistic amplification, as its number of components 
increases. Therefore, applications running on large computing systems have to cope with platform faults. 
There are two main approaches. On the one hand, applications can use fault-tolerance mechanisms such 
as checkpoint and rollback in order to become resilient. On the other hand, system administrators can 
try to predict where and when faults will strike. Although considerable research has been devoted to 
fault predictors [2 El El El [3 [S] , no predictor will ever be able to predict every fault. Therefore, fault 
predictors will have to be used in conjunction with fault-tolerance mechanisms. 

In this paper, we assess the impact of fault prediction techniques on checkpointing strategies. We 
assume to have jobs executing on a platform subject to faults, and we let /i be the mean time between 
faults (MTBF) of the platform. In the absence of fault prediction, the standard approach is to take 
periodic checkpoints, each of length C, every period of duration T. In steady-state utilization of the 
platform, the value Topt of T that minimizes the expected waste of resource usage due to checkpointing is 
approximated as Topt = \/'2.^C + C, or Topt = v^2(/x + R)C+C (where R is the duration of the recovery). 
The former expression is the well-known Young's formula [5] , while the latter is due to Daly [TU] • 

Now, when some fault prediction mechanism is available, can we compute a better checkpointing 
period to decrease the expected waste? and to what extent? Critical parameters that characterize a 
fault prediction system are its recall r, which is the fraction of faults that are indeed predicted, and its 
precision p, which is the fraction of predictions that are correct (i.e., correspond to actual faults). The 
major objective of this paper is to refine the expression of the expected waste as a function of these new 
parameters, and to design efficient checkpointing policies that take predictions into account. The key 
contributions of this paper are: 

• A refined first-order analysis in the absence of fault prediction. It leads to similar performance to 
Young [9] and Daly [10] when faults follow an Exponential distribution, and to better performance 
when faults follow a Weibull distribution. 

• The extension of this analysis to fault predictions, and the design of a new checkpointing policy 
that takes optimal decisions on whether to take these predictions into account or to ignore them. 



• An extensive set of simulations that corroborates all mathematical derivations, both for Exponential 
fault distributions, and for (more realistic) WeibuU fault distributions. 

The rest of the paper is organized as follows. We first detail the framework in Section [2j We revisit 
Young and Daly's approach in Section |3] We provide an optimal algorithm to account for predictions in 
Section |4j Section [5] is devoted to simulations. We discuss related work in Section |6] Finally, we provide 
concluding remarks in Section [7] 



2 Framework 



2.1 Checkpointing strategy 

We consider a platform subject to faults. Our work is agnostic of the granularity of the platform, 
which may consist either of a single processor, or of several processors that work concurrently and use 
coordinated checkpointing. Checkpoints are taken at regular intervals, or periods, of length T. We 
denote by C the duration of a checkpoint (all checkpoints have same duration). By construction, we 
must enforce that C <T. When a fault strikes the platform, the application is lacking some resource for 
a certain period of time of length D, the downtime. The downtime accounts for software rejuvenation 
(i.e., rebooting I12| ) or for the replacement of the failed hardware component by a spare one. Then, 
the application recovers from the last checkpoint. R denotes the duration of this recovery time. 



2.2 Fault predictor 

A fault predictor is a mechanism that is able to predict that some faults will take place, either at a certain 
point in time, or within some time-interval window. In this paper, we assume that the predictor is able 
to provide exact prediction dates, and to generate such predictions early enough so that a proactive 
checkpoint can indeed be taken before the event. 

The accuracy of the fault predictor is characterized by two quantities, the recall and the precision. 
The recall r is the fraction of faults that are predicted while the precision p is the fraction of fault 
predictions that are correct. Traditionally, one defines three types of events: (i) True positive events are 
faults that the predictor has been able to predict (let Truep be their number); (ii) False positive events 
are fault predictions that did not materialize as actual faults (let Falscp be their number); and (iii) False 
negative events are faults that were not predicted (let Falser be their number). With these definitions, 

have r = j^^^l^i^^^ and p = Tml^l^ise, ■ 

Proactive checkpoints may have a different length Cp than regular checkpoints of length C. In fact 
there are many scenarios. On the one hand, we may well have Cp > C in scenarios where regular check- 
points are taken at time-steps where the application memory footprint is minimal [13] : on the contrary, 
proactive checkpoints are taken according to predictions that can take place at arbitrary instants. On 
the other hand, we may have Cp < C in other scenarios [5], e.g., when the prediction is localized to a 
particular resource subset, hence allowing for a smaller volume of checkpointed data. 

To keep full generality, we deal with two checkpoint sizes in this paper: C for periodic checkpoints, 
and Cp for proactive checkpoints (those taken upon predictions). 

In the literature, the lead time is the interval between the date at which the prediction is made 
available, and the actual prediction date. However, we point out that the distribution of these lead times 
is irrelevant to the problem: either a fault is predicted at least Cp seconds in advance, and then one 
can checkpoint just in time before the fault, or the prediction is useless! In other words, predictions 
that come too late should be classified as unpredicted faults whenever they materialize as actual faults, 
leading to a smaller value of the predictor recall. 



2.3 Fault rates 

The key parameter is /x, the mean time between faults (MTBF) of the platform. If the platform is made 
of N components whose individual MTBF is /ijnd , then /i — . This result is true regardless of the 
fault distribution la'wEl 

^For the sake of completeness, we provide a proof of this widely-used result in|A] To the best of our knowledge, no proof 
has been published in the literature yet. 
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In addition to /x, the platform MTBF, let be the mean time between predicted events (both true 
positive and false positive), and let /inp be the mean time between unpredicted faults (false negative). 
Finally, we define the mean time between events as (including all three event types). The relationships 
between /x, ^p, /^np, and /^o are the following: 

• Rate of unpredicted faults: -jj^ — -^^^ since 1 — r is the fraction of faults that are unpredicted; 

• Rate of predicted faults: ^ — since r is the fraction of faults that are predicted, and p is the 
fraction of fault predictions that are correct; 

• Rate of events: ^ = ^ + > since events are either predictions (true or false) , or unpredicted 
faults. 

2.4 Objective: waste minimization 

The natural objective is to minimize the expectation of the total execution time, makespan, of the 
application. Instead, in order to ease mathematical derivations, we aim at minimizing the waste. The 
waste is the expected percentage of time lost, or "wasted", during the execution. In other words, the 
waste is the fraction of time during which the platform is not doing useful work. This definition was 
introduced by Wingstrom [l^. Obviously, the lower the waste, the lower the expected makespan, and 
reciprocally. Hence the two objectives are strongly related and minimizing one of them also minimizes 
the other. 



3 Revisiting Daly's first-order approximation 

Young proposed in [S] a "first order approximation to the optimum checkpoint interval" . Young's formula 
was later refined by Daly [TU] to take into account the recovery time. We revisit their analysis using the 
notion of waste. 

Let TiMEbasG be the base time of the application without any overhead (neither checkpoints nor 
faults). First, assume a fault- free execution of the application with periodic checkpointing. In such an 
environment, during each period of length T we take a checkpoint, which lasts for a time C, and only 
T — C units of work are executed. Let TiMEpp be the execution time of the application in this setting. 
Following most works in the literature, we also take a checkpoint at the end of the execution. The 
fault-free execution time Timeff is equal to the time needed to execute the whole application, TiMEbase, 
plus the time taken by the checkpoints: 

TIMEff = TiMEbasc + A^ckptC (1) 

where A'^ckpt is the number of checkpoints taken. We have 



TiMEbas. 

T-C 



TiMEbas. 

T-C 



When discarding the ceiling function, we assume that the execution time is very large with respect to 
the period or, symmetrically, that there are many periods during the execution. Plugging back the 
(approximated) value iVckpt — ^ , we derive that 

TiMEbasc ^ 

Timeff = C (2) 

The waste due to checkpointing in a fault-free execution, Wasteff, is defined as the fraction of the 
execution time that does not contribute to the progress of the application: 

WASTEff = TIMEff - TiMEbasc ^ _ WASTEpp) TlMEpF = TiMEbaso (3) 

TlMEpp ^ ^ 



Combining Equations ^ and ([s]), we get: 



C 

WASTEpp = - (4) 
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Now, let TiMEfinai denote the expected execution time of the application in the presence of faults. 
This execution time can be divided into two parts: (i) the execution of "chunks" of work of size T — C 
followed by their checkpoint; and (ii) the time lost due to the faults. This decomposition is illustrated 
by Figure [T] The first part of the execution time is equal to TiMEpp ■ Let iVfauits be the number of faults 
occurring during the execution, and let Tiost be the average time lost per fault. Then, 

TiMEfinal = TlMEpF + A^faults X Tjost (5) 



_ TiMEfi, 

instants at which periods begin and at which faults strike are independent. Therefore, the expected 

T 
2 



On average, during a time TiMEfinai, Mauits — faults happen. We need to estimate Tiost- The 

i strike are independer 

time elapsed between the completion of the last checkpoint and a fault is ? for all distribution laws. 



regardless of their particular shape. We conclude that Tiost = + -D + i?, because after each fault there 
is a downtime and a recovery. This leads to: 

^ TiMEfinal , „ , T 

TiMEfinal = TlMEpF H X \D + R+ — 

H \ 2 

Let WASTEfauit be the fraction of the total execution time that is lost because of faults: 

TiMEfinal — TlMEpF / x ^ ^ 

WASTEfauit = 7^ <^ (1 - WASTEfauit) TlMEfi„al = TlMEpF (6) 

TiMEfinal 



We derive: 



WASTEfauit = ^ + + I ) • C^) 



T-C \C\ T-C \CM T-C \C\ T-C \C 



T-C C T-C \C\ T-C C T-C C T-C \C\ 




TIMEff =TlMEFmal (1-WASTEFail ) TlME| 

< 

TiMEFinal 

Figure 1: An execution (top), and its re-ordering (bottom), to illustrate both sources of waste. Blackened 
intervals correspond to work destroyed by faults, downtimes, and recoveries. 

In [TU], Daly uses the expression 

TiMEfinal = (l + WASTEfauit) TlMEpF (8) 

instead of Equation (|6]), which leads him to his well-known first-order formula 



T=^2{pi+{D + R))C + C (9) 

Figure [T] explains why Equation (|8| is not correct and should be replaced by Equation ^ . Indeed, 
the expected number of faults depends on the final time, not on the time for a fault-free execution. 
We point out that Young [S] also used Equation ([s]), but with D = R = 0. Equation ^ can be 
rewritten TiMEfinal = TiMEpp/ (1 — WASTEfauit)- Therefore, using Equation (|8| instead of Equation (|6]), 
in fact, is equivalent to write j^_^y^^^^^ — - « 1 + WASTEfauit which is indeed a first-order approximation 

if WASTEfauit < 1. 

Now, let Waste denote the total waste: 

TiMEfinal - TiMEbase 

Waste = — (10) 

TiMEfinal 

Therefore 

1 TiMEbaso TiMEbasc TiMEpp 

Waste = 1 - ^ 1 - — = 1 - 1 - Wastepp (1 - WASTEfauit)- 

TiMEfinal TiMEpp TiMEfinal 
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Altogether, we derive the final result: 



Waste = WASTEff + WASXEfault - WASXEppWASTEfault (11) 

(12) 



c 

























We obtam Waste = f + w + where u = C(l - ^±^) , v = °+^-^/^ , and w = ^ . Thus Waste 
is minimized for T — xP^- The Refined First-Order (RFO) formula for the optimal period is thus: 



Trfo = ^/2{y,-(D + R))C (13) 



It is interesting to point out why Equation ( 13 ) is a first-order approximation, even for large jobs. 



Indeed, there are several restrictions to enforce for the approach to be valid: 

• We have stated that the expected number of faults during execution is A^fauits = "^'"f"""' , and that 



the expected time lost due to a fault is Tiost = \- Both statements are true individually, but 
the expectation of a product is the product of the expectations only if the random variables are 
independent, which is not the case here because TiMEgnai depends upon the failure inter- arrival 
times. 

• In Equation (4), we have to enforce C < T to have Wasteff < 1 

• In Equation (7), we have to enforce D + R < ii and to bound T in order to have WASTEfauit ^ 1- 
Intuitively, we need fi to be large enough for Equation Q to make sense. However, regardless of 
the value of the individual MTBF ^indj there is always a threshold in the number of components 
N above which the platform MTBF ii = becomes too small for Equation ([7| to be valid. 

• Equation ([t]) is accurate only when two or more faults do not take place within the same period. 
Although unlikely when fj, is large in front of T, the possible occurrence of many faults during the 
same period cannot be eliminated. 

To ensure that the latter condition (at most a single fault per period) is met with a high probability, 
we cap the length of the period: we enforce the condition T < afi, where a is some tuning parameter 
chosen as follows. The number of faults during a period of length T can be modeled as a Poisson 
process of parameter j3 — ^. The probability of having A: > faults is P{X = fc) = tt^~^i where 
X is the number of faults. Hence the probability of having two or more faults is tt = P{X > 2) = 
1 - {P{X = 0) + P{X = 1)) = 1 - (1 + I3)e-^. If we assume a = 0.27 then tt < 0.03, hence a valid 
approximation when bounding the period range accordingly. Indeed, with such a conservative value for 
a, we have overlapping faults for only 3% of the checkpointing segments in average, so that the model 
is quite reliable. For consistency, we also enforce the same type of bound on the checkpoint time, and 
on the downtime and recovery: C < a/i and D + R < afi. However, enforcing these constraints may 
lead to use a sub-optimal period: it may well be the case that the optimal period ^2(/i — (_D -f R))C of 



Equation (131 does not belong to the admissible interval [C, a/i]. In that case, the waste is minimized 
for one of the bounds of the admissible interval: this is because, as seen from Equation ( |l2| ), the waste 
is a convex function of the period. 

We conclude this discussion on a positive note. While capping the period, and enforcing a lower bound 
on the MTBF, is mandatory for mathematical rigor, simulations (see Section [5] for both Exponential and 



WeibuU distributions) show that actual job executions can always use the value from Equation (131, 
accounting for multiple faults whenever they occur by re-executing the work until success. The first- 
order model turns out to be surprisingly robust! 

To the best of our knowledge, despite all the limitations above, there is no better approach to 
estimate the waste due to checkpointing when dealing with arbitrary fault distributions. However, 
assuming that faults obey an Exponential distribution, it is possible to use the memory-less property of 
this distribution to provide more accurate results. A second-order approximation when faults obey an 
Exponential distribution is given in Daly [101 Equation (20)] as TiMEfinai — fJ-e^'^ ^ (e — 1 ) TiMEba.e ^ ^^^^^ 

in that case, the exact value of TiMEfinai is provided in [HIITB] as TiMEfinai = {fi+D)e^/''{e^ -1) '^"j^'g" , 

-£-1 

and the optimal period is then ^^^''^'^^ ^ — - where L, the Lambert function, is defined as h{z)e^'^^'' — z. 
To assess the accuracy of the different first order approximations, we compare the periods defined 



by Young's formula |9j, Daly's formula [10^, and Equation (13), to the optimal period, in the case of an 
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Exponential distribution. Results are reported in Table [T] To establish these results, we use the same 
parameters as in Section [5} C = R = 600 s, D = 60 s, and fiind = 125 years. Furthermore, to compute 
the optimal period, for each platform size we choose the application size so that TiMEbaso — 2 hours. 
One can observe in Table [T] that the relative error for Daly's period is slightly larger than the one for 
Young's period. In turn, the absolute value of the relative error for Young's period is slightly larger than 
the one for RFO. More importantly, when Young's and Daly's formulas overestimate the period, RFO 
underestimates it. Table [l] does not allow us to assess whether these differences are actually significant. 
However we also report in Section 5.2 some simulations that show that Equation (13) leads to smaller 
execution times for WeibuU distributions than both classical formulas (Tables [s] an d|4). 



N 


A* 


Young 


Daly 


RFO 


Optimal 


2iu 


3849609 


68567 


(0.5 


%) 


68573 


(0.5 


%) 


67961 


(-0.4 


%) 


68240 


2" 


1924805 


48660 


(0.7 


%) 


48668 


(0.7 


%) 


48052 


(-0.6 


%) 


48320 


212 


962402 


34584 


(1.2 


%) 


34595 


(1.2 


%) 


33972 


(-0.6 


%) 


34189 


213 


481201 


24630 


(1.6 


%) 


24646 


(1.7 


%) 


24014 


(-0.9 


%) 


24231 


214 


240601 


17592 


(2.3 


%) 


17615 


(2.5 


%) 


16968 


(-1.3 


%) 


17194 


215 


120300 


12615 


(3.2 


%) 


12648 


(3.5 


%) 


11982 


(-1.9 


%) 


12218 


216 


60150 


9096 


(4.5 


%) 


9142 


(5.1 


%) 


8449 


(-2.9 


%) 


8701 




30075 


6608 


(6.3 


%) 


6673 


(7.4 


%) 


5941 


(-4.4 


%) 


6214 


218 


15038 


4848 


(8.8 


%) 


4940 


(10.8 


%) 


4154 


(-6.8 


%) 


4458 


219 


7519 


3604 


(12.0 


%) 


3733 


(16.0 


%) 


2869 


(-10.8 


%) 


3218 



Table 1: Comparing periods produced by the different approximations with optimal value. Beside each 
period, we report its relative deviation to the optimal. Each value is expressed in seconds. 



4 Taking predictions into accounts 

In this section, we present an analytical model to assess the impact of predictions on periodic checkpoint- 
ing strategies. As already mentioned, we consider the case where the predictor is able to provide exact 
prediction dates, and to generate such predictions at least Cp seconds in advance, so that a proactive 
checkpoint of length Cp can indeed be taken before the event. 



For the sake of clarity, we start with a simple algorithm (Section 4.1 1 which we refine in Section 4.2 



We then compute the value of the period that minimizes the waste in Section |4.3 



4.1 Simple algorithm 

In this section, we consider the following algorithm: 

• While no fault prediction is available, checkpoints are taken periodically with period T; 

• When a fault is predicted, there are two cases: either there is the possibility to take a proactive 
checkpoint, or there is not enough time to do so, because we are already checkpointing (see Fig- 
ures [2|b) and[2|^c)). In the latter case, there is no other choice than ignoring the prediction. In 
the former case, we still have the possibility to ignore the prediction, but we may also decide to 
trust it: in fact the decision is randomly taken. With probability q, we trust the predictor and 
take the prediction into account (see Figures [2jf) and[2](g)) ,and, with probability 1 — we ignore 
the prediction (see Figures [2]jd) and[2je)); 

• If wc take the prediction into account, we take a proactive checkpoint (of length Cp) as late as 
possible, i.e., so that it completes right at the time when the fault is predicted to happen. After 
this checkpoint, we complete the execution of the period (see Figures [2jf) and[2](g)); 

• If we ignore the prediction, either by necessity (not enough time to take an extra checkpoint, see 
Figures [2|b) and[2](c)), or or by choice (with probability 1 — g. Figures [2jd) and[2je)), we finish 
the current period and start a new one. 
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(a) Unpredicted fault 



Predicted fault 

F^j n V 

T-C T-C T-C T-C Timt 

(b) Prediction cannot be taken into account - no actual fault 

fault/ Predicted fault 



FM F. n F]: 

T-C Tios, T-C T-C T-C Time 

(c) Prediction cannot be taken into account - with actual fault 

/ Predicted fault 

F. 'f. R F 

T-C T-C T-C T-C Time 

(d) Prediction not taken into account by choice - no actual fault 

fault / Predicted fault 



' '( >' '< r ' '< )' '( — » ' ' 7^ 

T-C Tiosi T-C T-C Time 

(e) Prediction not taken into account by choice - with actual fault 

/ Predicted fault 

^. F. K F. F. ^ V 

T-C W„g T-W,-eg-C T-C T-C Time 

(f) Prediction taken into account - no actual fault 

fault/ Predicted fault 



— ( ^ — — ( ) — ( f — 

T-C Wreg T-W„g-C T-C Time 

(g) Prediction taken into account - with actual fault 

Figure 2: Actions taken for the different event types. 



The rationale for not always trusting the predictor is to avoid taking useless checkpoints too frequently. 
Intuitively, the precision p of the predictor must be above a given threshold for its usage to be worthwhile. 
In other words, if we decide to checkpoint just before a predicted event, either we will save time by 
avoiding a costly re-execution if the event does correspond to an actual fault, or we will lose time by 
unduly performing an extra checkpoint. We need a larger proportion of the former cases, i.e., a good 
precision, for the predictor to be really useful. The following analysis will determine the optimal value 
of g as a function of the parameters C, Cp, /i, r, and p. 

We could refine the approach by taking into account the amount of work already done in the current 
period when deciding whether to trust the predictor or not. Intuitively, the more work already done, the 
more important to save it, hence the more worthwhile to trust the predictor. We design such a refined 
strategy in Section |4.2| Right now, we analyze a simpler algorithm where we decide to trust or not to 
trust the predictor, independently of the amount of work done so far within the period. 

We analyze the algorithm in order to compute a formula for the expected waste, just as in Equa- 
tion ([T2|. While the value of Wasteff is unchanged (Wasteff = the value of WASTEfauit is 
modified because of predictions. As illustrated in Figure |2] there are many different scenarios that 
contribute to WASTEfauit that can be sorted into three categories: 



(1) Unpredicted faults: This overhead occurs each time an unpredicted fault strikes, that is, on aver- 
age, once every /inp seconds. Just as in Equation Q, the corresponding waste is ['^ + D + R\. 

(2) Predictions not taken into account: The second source of waste is for predictions that are 
ignored. This overhead occurs in two different scenarios. First, if we do not have time to take a proactive 
checkpoint, we have an overhead if and only the prediction is an actual fault. This case happens with 
probability p. We then lose a time t + D + R if the predicted fault happens a time t after the completion 
of the last periodic checkpoint. The expected time lost is thus 



TLt = fj/ + D + R) + {1- p)0) dt 



Then, if we do have time to take a proactive checkpoint but still decide to ignore the prediction, we also 
have an overhead if and only the prediction is an actual fault, but the expected time lost is now weighted 
by the probability (1 — q): 



TLt = {l-q)^ f {p{t + D + R) + {1- p)Q) dt 



(3) Predictions taken into account: We now compute the overhead due to a prediction which we 
trust (hence wc checkpoint just before its date). If the prediction is an actual fault, we lose Cp + D + i? 
seconds, but if it is not, we lose the unnecessary extra checkpoint time Cp. The expected time lost is 
now weighted by the probability q and becomes 



TLt = 4 f ^P^^P + D + R) + {1- p)Cp) dt 



We derive the final value of WASTEfauit : 

\T 



WASTEfauit = — 
MNP 



2 



D + R 



1 

MP 



[^lost + ^lost + ^lost] 



This final expression comes from the disjunction of all possibles cases, using the Law of Total Probabil- 
ity [17', p. 23]: the waste comes either from non-predicted faults or from predictions; in the latter case, we 
have analyzed the three possible sub-cases and weighted them with their respective probabilities. After 
simplifications, we obtain 

WASTEfauit = - ({I - rq)^ + D + R+'^Cp - '^{l - p/2) ) (14) 

^l \ 2 p pT 




We could now plug this expression back into Equation (111 to compute the value of T that minimizes 



the total waste. Instead, we move on to describing the refined algorithm, and we minimize the waste for 
the refined strategy, since it always induces a smaller waste. 



4.2 Refined algorithm 

In this section, we refine the approach and consider different trust strategies, depending upon the time 
in the period where the prediction takes place. Intuitively, the later in the period, the more likely we 
are inclined to trust the predictor, because the amount of work that we could lose gets larger and larger. 
As before, we cannot take into account a fault predicted to happen less than Cp units of time after the 
beginning of the period. Therefore, we focus on what happens in the period after time Cp. Formally, 
we now divide the interval [Cp,T] into n intervals [A; A+i] for i e {0, • • • , n — 1}, where (3q = Cp and 
Pn = T. For each interval [I3i](ii+i\, we trust the predictor with probability qi. We aim at determining 
the values of n, /3i, and qi that minimize the waste. As mentioned before, intuition tells us that the 
qi values should be non-decreasing. We prove below a somewhat unexpected theorem: in the optimal 
strategy, there is either one or two different qi values, and these values are or 1. This means that 
we should never trust the predictor in the beginning of a period, and always trust it in the end of the 
period, without any intermediate behavior in between. 

We formally express this striking result below. Let /Siim = The optimal strategy is provided by 
Theorem [l] below. We first prove the following proposition: 
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Proposition 1. The values of /3i and qi that minimize the waste satisfy the following conditions: 
(i) For all i such that (3^ 



'i+l < Plim, Qi — 0. 

(a) For all i such that j5i > Pum, Qi — ^■ 



Proof. First we compute the waste with the refined algorithm, using Equation (11). The formula for 
WASTEfauit is similar to Equation ( 14 1 on each interval: 



Waste = 



C 
T 





1 












.MNP 






Now, consider a fixed value of i and express the value of Waste as a function of qi 



Waste = K 



1 - 



Si 



Cp 



pt 
T 



dt 



where K does not depend on qi . From the sign of the function to be integrated, one sees that Waste is 
minimized when = if < Aim = and when 9i = 1 if A > Aim- D 

Theorem 1. The optimal algorithm takes proactive actions if and only if the prediction falls in the 
interval [A^m, T] . 

Proof. From Proposition [T] the values for qi are optimally defined for every i but one: we do not know 
the optimal value if there exists iq such that Aio < /^lim < Ao+i- Then let us consider the waste where 
Qig is replaced by q[^^ on [Ao: Aim] and by gl^'' on [Aim, Ao+i]- The new waste is necessarily smaller than 
the one with only qi^, since we relaxed the constraint. We know from Proposition [l] that the optimal 



solution is then to have q. 



(1) 



and q. 



(2) 



□ 



Let us now compute the value of the waste with the optimal algorithm. There are two cases, depending 
upon whether T < Aim or not. For values of T smaller than Aim, Theorem [l] shows that the optimal 
algorithm never takes any proactive action; in that case the waste is given by Equation ( 12 ) in Section [s] 



For values of T larger than Aim 



'-, we compute the waste due to predictions as 




p{t + D + R)dt 
p{D + R) + Cp 



Cp/p 



ipiCp + D + R) + {1- p)Cp)dt 



2pT 



Indeed, in accordance with Theorem 1 1 no prediction is taken into account in the interval [0, — ], while 
all predictions are taken into account in the interval [-^, T]. Adding the waste due to unpredicted faults, 
namely [ ^ + D + i?] , we derive 

MNP L /! J ' 



WASTEfauit = - (1 - r) 



-a, 1 - 



)_Cp 
2p T 



D + R 
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Plugging this value into Equation (11), we obtain the total waste when <T: 



CI/ T r 

Waste - - + - (1 - r)- + -Cp 
T fi \ 2 p 



rCCl 1 




Altogether, the expression for the total waste becomes: 

WASTEi(r) ^ ^i^-^) + D+R-C/2 ^ if^>T 



C 1- 



(15) 



One can check that when r — (no error predicted, hence no proactive action in the algorithm), then 
WastEi and WASTE2 coincide. We also check that both values coincide for T — —. We show how to 



minimize the waste in Equation ( 15 ) in Section 4.3 



4.3 Waste minimization 



In this section we focus on minimizing the waste in Equation (15 1. Recall that, by construction, we 



always have to enforce the constraint T > C. First consider the case where C < On the interval 

T e [C, ^], we retrieve the optimal value found in Section jsj and derive that WastEi, the waste when 
predictions are not taken into account, is minimized for 

TnoPred = max ^C, min ^Trfo, ^ (16) 

Indeed, the optimal value should belong to the interval [C, and the function WastEi is convex: if 

the extremal solution ■y/2(/i — [D + R))C does not belong to this interval, then the optimal value is one 
of the bounds of the interval. 

On the interval T e +00^, we find the optimal solution by differentiating twice WASTE2 with 

respect to T. Writing WastE2(T) = i^ + ^+w+xT for simplicity, we obtain WASTE2(r) = ^ + w). 
Here, a key parameter is the sign of : 




We detail the case u > in the following, because it is the most frequent with realistic parameter sets; 
we do have v > Q for all the whole range of simulations in Section [5j For the sake of completeness, we 
will briefly discuss the case ?; < in the comments below. 

When u > 0, we have WASTE2(r) > 0, so that WASTE2 is convex on the interval ^,+00^ and 
admits a unique minimum Tcxti - Note that Tcxtr can be computed either numerically or using Cardano's 
method, since it is the unique real root of a polynomial of degree 3. The optimal solution on +ooj 

is then: Tpred = max (^Toxtr, ^) • 

It remains to consider the case where < C. In fact, it suffices to add the constraint that the value 

of 

TpRED should be greater than C, that is: 

TpRED = max ^C, max ^Textr, (1'^) 
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Finally, the optimal solution for the waste is given by the minimum of the following two values: 



D + R-C/2 , 1 rri 

]1 + 2^^NoPre 



rCC: 



Pred 



C 1 



-(!-'■)§ + - 



-D+R 



^Pred 



2;i 



Pred 



We make a few observations: 



Just as for Equation (13) in Section [s] mathematical rigor calls for capping the values of Z?, i?, C, 
Cp and T in front of the MTBF. The only difference is that we should replace ^hy this is to 
account for the occurrence rate of all events, be they unpredicted faults or predictions. 

While the expression of the waste looks complicated, the numerical value of the optimal period can 
easily be computed in all cases. We have dealt with the case w > 0, where v is the coefficient of 
1/T in WastE2(T) — ^-\- ^+w + xT. When i> < we only needs to compute aU the nonnegative 



real roots of a polynomial of degree 3, and check which one leads to the best value. More precisely, 

+co \ into several sub-intervals, and the optimal 



these root(s) partition the admissible interval 
value is either a root or a sub-interval bound. 



In many practical situations, when is large enough, we can dramatically simplify the expression of 
WASTE2(r): we have T — 0{y/JI), the term becomes negligible, checkpoint parameters become 

negligible in front of ^, and we derive the approximated value ^ fz^- This value can be seen as 
an extension of Equation (13) giving TrfOj where is replaced by t^-: faults are replaced by 



non-predicted faults, and the overhead due to false predictions is negligible. As a word of caution, 
recall that this conclusion is valid only when fi is very large in front of all other parameters. 



5 Simulation results 

We start by presenting the simulation framework (Section |5.1[ ). Then we report results using the char- 
acteristics of two fault predictors from the literature (Section |5.2[ ). Finally, we assess the respective 
impact of the two key parameters of the predictor, its recall and its precision, on checkpointing strategies 
(Section 5.3 ). 



5.1 Simulation framework 

In order to validate our model, we have instantiated it with several scenarios. The experiments use 
parameters that are representative of current and forthcoming large-scale platforms [151 US] ■ We take 
C = R = 10 min, and D = 1 min. For the proactive checkpoints we consider three scenarios where 
proactive checkpoints are (i) exactly as expensive as periodic ones {Cp = C), (ii) ten times cheaper 
(Cp = O.IC), and (iii) two times more expensive {Cp = 2C). The individual (processor) MTBF is 
Z^ind = 125 years, and the total number of processors N varies from N = 16, 384 to TV = 524, 288, so that 
the platform MTBF ^ varies from /i = 4, 010 min (about 2.8 days) down to /i = 125 min (about 2 hours). 
For instance the Jaguar platform, with N — 45, 208 processors, is reported to experience about one fault 
per day [1], which leads to ^ind — ^g'g"^ ~ 125 years. The apphcation size is set to TiMEbasc — 10,000 
years/N. 

We used Maple to analytically compute and plot the optimal value of the waste for both the algorithm 
taking predictions into account, OptimalPrediction, and for the algorithm ignoring them, RFO. In 
order to check the accuracy of our model, we have compared these results with results obtained with 
a discrete-event simulator. The simulation engine generates a random trace of faults, parameterized 
either by an Exponential fault distribution or by Weibull distribution laws with shape parameter 0.5 
or 0.7. Note that Exponential faults are widely used for theoretical studies, while Weibull faults are 
representative of the behavior of real- world platforms [501 IHl 122] • In both cases, the distribution is 
scaled so that its expectation corresponds to the platform MTBF /i. With probability r, we decide if a 
fault is predicted or not. The simulation engine also generates a random trace of false predictions, whose 
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distribution is identical to that of the first trace (in |B] we also consider the case where false predictions 
are generated according to a uniform distribution; results are quite similar). This second distribution is 
scaled so that its expectation is equal to = r(i-p) ' inter-arrival time of false predictions. Finally, 
both traces are merged to produce the final trace including all events (true predictions, false predictions, 
and non predicted faults). Each reported value is the average over 100 randomly generated instances. 
In the simulations, we compare four checkpointing strategies: 

• RFO is the checkpointing strategy of period T — y^2(fi — {D + R))C (see Section [3|. 

• OptimalPrediction is the refined algorithm described in Section [42] 

• To assess the quality of each strategy, we compare it with its BestPeriod counterpart, defined as 
the same strategy but using the best possible period T. This latter period is computed via a brute- 
force numerical search for the optimal period (each tested period is evaluated on 100 randomly 
generated traces, and the period achieving the best average performance is elected as the "best 
period"). 

5.2 Predictors from the literature 

We first experiment with two predictors from the literature: one accurate predictor with high recall and 
precision [7j, namely with p = 0.82 and r = 0.85, and another predictor with intermediate recall and 
precision [H], namely with p = 0.4 and r = 0.7. Figures [s] and |4] show the average waste degradation for 
the two checkpointing policies, and for their BestPeriod counterparts, for both predictors. The waste 
is reported as a function of the number of processors N . We draw the plots as a function of the number 
of processors N rather than of the platform MTBF — Hind/^, because it is more natural to see the 
waste increase with larger platforms. However, recall that this work is agnostic of the granularity of the 
processors and intrinsically focuses on the impact of the MTBF on the waste. 

The first observation is the very good correspondence between analytical results and simulations in 
Figures [3] and |4] In particular, the Maple plots and the simulations for Exponentially distributed faults 
are very similar. This shows the validity of the model and of its analysis. Another striking result is 
that OptimalPrediction has the same waste as its BestPeriod counterpart even for WeibuU fault 
distributions, which demonstrates that our period Tpred is indeed the best possible checkpointing period. 
These conclusions are valid regardless of the cost ratio of periodic and proactive checkpoints. 

The second observation is that the prediction is useful for the vast majority of the set of parameters 
under study! However, when proactive checkpoints are cheaper than periodic ones, the benefits of fault 
prediction are increased. On the contrary, when proactive checkpoints are more expensive than periodic 
ones, the benefits of fault prediction are greatly reduced. One can even observe that the waste with 
prediction is not better than without prediction in the following scenario: Cp — 2C, and using the 
limited-quality predictor {p — 0.4, r ~ 0.7) with 2^^ processors, see Figures [4]ji),(j),(k) and (1). 

In order to compare the heuristics without prediction to those with prediction, we report job execution 
times when fault distribution follows either an Exponential distribution law (Table [2]) , or a WeibuU 
distribution law (Table |3] for shape parameter k = 0.7 and Table |4] for k = 0.5). 

We compute the gain (expressed in percentage) achieved by OptimalPrediction over RFO. We also 
add in these tables the execution times obtained when using the expression of T given by Young [9] and 
Daly [lU] (denoted respectively as YoUNG and Daly) to assess whether Trfo is a better approximation. 
Recall that these three approaches do not use any predictor, which explains why the numbers are identical 
on both sides of the tables. 

As a general trend, we observe that the gains due to predictions are more important when the distri- 
bution law is further apart from an Exponential distribution. Indeed, the largest gains are when the fault 
distribution follows a WeibuU law of parameter 0.5. Using OptimalPrediction in conjunction with a 
"good" fault predictor gains up to 66% when there is a large number of processors (2^^). The gain is still 
of 37% with 2^^ processors. Using a predictor with limited recall and precision, OptimalPrediction 
can still decrease the execution time by 47% with 2^^ processors, and 31% with 2^^ processors. In all 
tested cases, the decrease of the execution times is significant. Gains are less important with WeibuU 
laws of shape parameter k = 0.7, however they are still reaching a minimum of 13% with 2^^ processors. 
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RFO 

— — BestPeriod RFO 



OptimalPrediction 

— — BestPeriod OptimalPrediction 




(a) Maple 




2ii 2"' 2" 

(b) Exponential 













(c) Weibull k = 


0.7 















(e) Maple 



(f) Exponential 




(d) Weibull k = 0.5 




(g) Weibull k = 0.7 



(h) Weibull k = 0.5 




(i) Maple 




(j) Exponential 



(k) Weibull fc = 0.7 



(1) Weibull k = 0.5 



Figure 3: Waste (y-axis) for the different heuristics as a function of the platform size (x-axis), with 
p = 0.82, r = 0.85, Cp = C (first row), Cp = O.IC (second row), or Cp = 2C (third row) and with a trace 
of false predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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(a) Maple (b) Exponential (c) WeibuU k = 0.7 (d) WeibuU k = 0.5 




(e) Maple (f) Exponential (g) WeibuU k = 0.7 (h) WeibuU k = 0.5 




(i) Maple (j) Exponential (k) WeibuU k = 0.7 (1) WeibuU k = 0.5 

Figure 4: Waste (y-axis) for the different heuristics as a function of the platform size (x-axis), with 
p = 0.4, r = 0.7, Cp — C (first row), Cp = O.IC (second row), or Cp = 2C (third row) and with a trace 
of false predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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and up to 38% with 2^^ processors. Finally, gains are further reduced with an Exponential law. They 
are still reaching at least 5% with 2^^ processors, and up to 19% with 2^^ processors. 

Coming back to the case without fault prediction, it is striking to observe in Tables |3] and |4] that 
job execution time increases together with the number for processors (from iV = 2^^ to N = 2^^) if 
the checkpointing period is Daly or Young. On the contrary, job execution time (rightfully) decreases 
when using RFO. However, the expressions of T given by Young, Daly and RFO are identical for 
Exponential distributions (Table [2| . This confirms the analytical evaluation of Table [l] in Section [sj 
Altogether, the main (striking) conclusion is that RFO should be preferred to both classical approaches 
for WeibuU distributions. 





Execution time (in days) 


Execution time (in days) 


Cp — c 


{p = 0.82, r = 0.85) 


{p = 0.4, r = 0.7) 




2^^ procs 


2^^ procs 


2^^ procs 


2-^^ procs 


Young 


65.2 


11.7 


65.2 


11.7 


Daly 


65.2 


11.8 


65.2 


11.8 


RFO 


65.2 


11.7 


65.2 


11.7 


OptimalPrediction 


60.0 (8%) 


9.5 (19%) 


61.7 (5%) 


10.7 (8%) 



Table 2: Job execution times for an Exponential distribution, and gains due to the fault predictor (with 
respect to the performance of RFO). 





Execution time (in days) 


Execution time (in days) 


Cp — c 


ip = 0.82 


r = 0.85) 
2^^ procs 


(p = 0.4 


r = 0.7) 
2^^ procs 


Young 


81.3 


30.1 


81.3 


30.1 


Daly 


81.4 


31.0 


81.4 


31.0 


RFO 


80.3 


25.5 


80.3 


25.5 


OptimalPrediction 


65.9 (18%) 


15.9 (38%) 


69.7 (13%) 


20.2 (21%) 



Table 3: Job execution times for a WeibuU distribution with shape parameter fc = 0.7, and gains due to 
the fault predictor (with respect to the performance of RFO). 





Execution time (in days) 


Execution time (in days) 


Cp — c 


ip = 0.82 


r = 0.85) 
2^^ procs 


{p = 0.4 

2l6 pYOCS 


r = 0.7) 
2^^ procs 


Young 


125.5 


171.8 


125.5 


171.8 


Daly 


125.8 


184.7 


125.8 


184.7 


RFO 


120.2 


114.8 


120.2 


114.8 


OptimalPrediction 


75.9 (37%) 


39.5 (66%) 


83.0 (31%) 


60.8 (47%) 



Table 4: Job execution times for a WeibuU distribution with shape parameter k — 0.5, and gains due to 
the fault predictor (with respect to the performance of RFO). 



5.3 Recall vs. precision 

In this section, we assess the impact of the two key parameters of the predictor, its recall r and its 
precision p. To this purpose, we conduct simulations where one parameter is fixed while the other varies. 
We choose two platforms, a smaller one with N = 2^^ processors (or a MTBF fj, = 1,000 niin) and a 
larger one with A'' = 2^^ processors (or a MTBF yu = 125 min). In both cases we study the impact of 
the predictor characteristics assuming a Weibull fault distribution with shape parameter 0.5 or 0.7. 

In Figures [5] and [6] we fix the value of r (either r = 0.4 or r = 0.8) and we let p vary from 0.3 to 0.99. 
In the four plots, we observe that the precision has a minor impact on the waste, whether it is with a 
Weibull distribution of shape parameter 0.7 (Figure [5]), or a Weibull distribution of shape parameter 0.5 
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0.4 0..') (1,7 D.l) O.i) 

(a) r = 0.4, N = 2^^ 



II, :i 11.4 0..^) (l.li 11.7 ll.iS 0,9 D.OO 

(b) r = 0.4, N = 



l),;i 11,4 0..^) D.li (J. 7 ll.N II,!) 11.00 

(c) r = 0.8, N = 2^6 



0..4 11,4 II, .^1 D.li (J. 7 0.8 (l.y O.OO 

(d) r = 0.8, N = 2i9 



Figure 5: Waste (y-axis) as a function of the precision (x-axis) for a fixed recall (r = 0.4 and r = 0. 
and for a Weibull distribution of faults (with shape parameter k = 0.7). 



().:i 0.4 0..7 (I.G 11,7 0.8 0.0 0.0 

(a) r = 0.4, TV = 2^^ 



0,;5 0.4 0..^) 0.6 (1.7 (1.8 0,0 0.00 

(b) r = 0.4, TV = 2^9 



0,;5 0,4 0..^) 0.(i 0.7 0.8 0,9 0.00 0..^ 0,4 0,.^j 0.(i 0.7 0.8 0.9 0.00 



(c) r = 0.8, N = 2^ 



(d) r = 0.8, N = 2^ 



Figure 6: Waste (y-axis) as a function of the precision (x-axis) for a fixed recall {r = 0.4 and r — 0. 
and for a Weibull distribution of faults (with shape parameter k — 0.5). 



(Figure|6|. In Figures [T] and [S] we conduct the converse experiment and fix the value of p (either p = 0.4 
or p = 0.8), letting r vary from 0.3 to 0.99. Here we observe that increasing the recall significantly 
improves performance. 

Altogether we conclude that it is more important (for the design of future predictors) to focus on 
improving the recall r rather than the precision p, and our results can help quantify this statement. We 
provide an intuitive explanation as follows: unpredicted faults prove very harmful and heavily increase 
the waste, while unduly checkpointing due to false predictions turns out to induce a smaller overhead. 




(1.3 0.4 0,.^j O.li 0.7 0.8 0.0 0,00 0.3 (1.4 0..^) (I.G 0.7 (1,8 0.0 0.00 

(a) p = 0.4, N = 216 (b) p = 0.4, N = 2^'^ 

Figure 7: Waste (y-axis) as a function of the recall 
and for a Weibull distribution (k=0.7). 




0.;! 0.4 0.5 o.li 0.7 0.8 (1,0 0.09 (l,.:i 0.4 liJ, o.li 0.7 0.8 0.0 0,00 

(c) p = 0.8, Af = 218 (d) p = 0.8, Af = 219 

(x-axis) for a fixed precision {p = 0.4 and p = 0.8) 
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II, :i 0.4 II, rj O.li (1.7 0.8 0.!) 0,99 0.:i 11.4 0..^) Il.(i 0.7 [1,8 ().!) 0.99 0.:i (1.4 0.', II.O 11.7 0.8 11,9 0.99 11,8 0.4 (l,.^j O.li (1.7 0.8 0.9 0,99 

(a) p = 0.4, TV = (b) p = 0.4, TV = 2^9 (c) p = 0.8, TV = (d) p = 0.8, W = 2" 



Figure 8: Waste (y-axis) as a function of the recall (x-axis) for a fixed precision [p — 0.4 and p — 0.8) 
and for a Weibull distribution (k=0.5). 
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300 s 


40 % 


70 % 


m 


600 s 


35 % 


60 % 


m 


2h 


64.8 % 


65.2 % 


m 


min 


82.3 % 


85.4 % 
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32 s 


93 % 


43 % 


m 


10s 


92 % 


40 % 


m 


60s 


92 % 


20 % 


u 


600s 


92 % 


3 % 


m 


NA 


70 % 


75 % 


m 


NA 


20 % 


30 % 
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NA 


30 % 


75 % 


m 


NA 


40 % 


90 % 


m 


NA 


50 % 


30 % 


m 


NA 


60 % 


85 % 



Table 5: Comparative study of different parameters returned by some predictors. 

6 Related work 

Considerable research has been devoted to fault prediction, using very different models (system log 
analysis '7], event-driven approach [11[71[S], support vector machines 6, 3J, nearest neighbors [6 , etc). In 
this section we give a brief overview of existing predictors, focusing on their characteristics rather than 
on the methods of prediction. For the sake of clarity, we sum up the characteristics of the different fault 
predictors encountered in Table [5] 

The authors of 8J introduce the lead time, that is the duration between the time the prediction is 
made and the time the predicted fault is supposed to happen. This time should be sufficiently large to 
enable proactive actions. As already mentioned, the distribution of lead times is irrelevant. Indeed, only 
predictions whose lead time is greater than Cp, the time to take a proactive checkpoint, are meaningful. 
Predictions whose lead time is smaller than Cp, whenever they materialize as actual faults, should be 
classified as unpredicted faults; the predictor recall should be decreased accordingly. 

The predictor of [8 is also able to locate where the predicted fault is supposed to strike. This 
additional characteristics has a negative impact on the precision (because a fault happening at the 
predicted time but not on the predicted location is classified as a non predicted fault; see the low value 
of p in Table [5| . The authors of [8 state that fault localization has a positive impact on proactive 
checkpointing time in their context: instead of a full checkpoint costing 1,500 seconds they can take a 
partial checkpoint costing only 12 seconds. This led us to introduce a different cost Cp for proactive 
checkpoints, that can be smaller than the cost C of regular checkpoints. Gainaru et al. |S] also stated 
that fault-localization could help decrease the checkpointing time. Their predictor also gives information 
on fault localization. They studied the impact of different lead times on the recall of their predictor. 
Papers [7] and [B] also considered lead times. 
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Most studies on fault prediction state that a proactive action must be taken right before the predicted 

fauh, be it a checkpoint or a migration. However, we have shown in this paper that it is beneficial to 

c 

ignore some predictions, namely when the predicted fault is announced to happen less than seconds 
after the last periodic checkpoint). 

Gainaru et al. |5] studied the impact of prediction on the checkpointing period. Their computation 
of the total waste is not fully accurate and they do not provide any minimization analysis. Instead, they 
only propose to use Young's formula, replacing the MTBF by the mean-time of unpredicted faults. They 
do not question whether all predictions should be taken into account. 

Li et al. |23j considered the mathematical problem of when and how to migrate. In order to be able 
to use migration, they assumed that at any time 2% of the resources are available as spares. This allows 
them to conceive a Knapsack-based heuristic. Thanks to their algorithm, they were able to save 30% 
of the execution time compared to a heuristic that does not take the prediction into account, with a 
precision and recall of 70%, and with a maximum load of 0.7. In our study we do not consider that we 
have a batch of spare resources. We assume that after a downtime the resources that failed are once 
again available. 

Note that some authors [ZJE] do not consider that their predictors predict the exact time of the fault. 
On the contrary, they consider a "prediction window" which is the time interval in which the predicted 
is supposed to occur. Because most papers focus on prediction windows of negligible length, we did not 
consider prediction windows in this study. 

Finally, to the best of our knowledge, this work is the first to focus on the mathematical aspect of 
fault prediction, and to provide a model and a detailed analysis of the waste due to all three types of 
events (true and false predictions and unpredicted failures). 



7 Conclusion 

In this work we have studied the impact of fault prediction on periodic checkpointing. We started by 
revisiting the first-order approach by Young and Daly. We have performed a refined analysis leading to a 
better checkpointing period: Tpred is slightly closer to the optimal period for Exponential distributions 
(the only case where the optimal is known) , and leads to smaller execution times for Weibull distributions 



(as shown in Section 5.2) 



Then we have extended the analysis to include fault predictions. We have established analytical 
conditions stating whether a fault prediction should be taken into account or not. More importantly, 
we have proven that the optimal approach is to never trust the predictor in the beginning of a regular 
period, and to always trust it in the end of the period; the cross-over point — depends on the time 
to take a proactive checkpoint and on the precision of the predictor. This striding result is somewhat 
unexpected, as one might have envisioned more trust regimes, with several intermediate trust levels 
smoothly evolving from a "never trust" policy to an "always trust" one. 

Through an extensive set of simulations involving faults following either an Exponential distribution 
law or a Weibull one, we have established the accuracy of the model, of its analysis, and of the predicted 
period (in the presence of a fault predictor). These simulations also show that even a not-so-good 
fault predictor can lead to quite a significant decrease in the application execution time. We have also 
shown that the most important characteristic of a fault predictor is its recall (the percentage of actually 
predicted faults) rather than its precision (the percentage of predictions that actually correspond to 
faults): better safe than sorry ^ or better prepare for a false event than miss an actual failure! 

Altogether, the analytical model and the comprehensive results provided in this work enable to fully 
assess the impact of fault prediction on optimal checkpointing strategies. Future work will be devoted 
to refine the assessment of the usefulness of prediction with trace-based failure and prediction logs from 
current large-scale supercomputers. 
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For the sake of completeness, we provide a proof of the following result: 

Proposition 2. Consider a platform comprising N components, and assume that the inter- arrival times 
of the faults on the components are independent and identically distributed random variables that follow 
an arbitrary probability law whose expectation is fj^ind- Then the expectation of the inter-arrival times of 
the faults on the whole platform is fi — . 

Proof. Consider first a single component, say component number q. Let Xi, i > denote the IID 
random variables for fault inter-arrival times on that component, with E (Xi) = /iind- Consider a fixed 
time bound F. Let nq{F) be the number of faults on the component until time F is exceeded. In other 
words, the {n^^F) — l)-th fault is the last one to happen strictly before time F, and the ng(F)-th fault 
is the first to happen at time F or after. By definition of ng{F), we have 

n„{F)-l n,{F) 

Using Wald's equation [24, p. 486], with nq{F) as a stopping criterion, we derive: 

(E (nqiF)) - l)/iind < ^ < E ing{F)) 

and we obtain: 

hm IMni^A. (18) 

F-).+oo F /iind 

Consider now the whole platform, and \ct Yi, i > denote the IID random variables for fault inter- 
arrival times on the platform, with E (Yi) = fi. Consider a fixed time bound F as before. Let n{F) be 
the number of faults on the whole platform until time F is exceeded. With the same reasoning for the 
whole platform as for a single component, we derive: 

E(n(F)) 1 , , 

lim ^ ^ =- 19 

Now let mg{F) be the number of these faults that strike component number q. Of course we have 

"(-f") = S^i T^q(.P)- By definition, except for the component hit by the last failure, mq{F) -I- 1 is the 
number of failures on component q until time F is exceeded, hence Uq (F) = ruq (F) -\- 1 (and this number 
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(a) Maple (b) Exponential (c) Weibull k = 0.7 (d) WeibuU k = 0.5 




(e) Maple (f) Exponential (g) Weibull k = 0.7 (h) Weibull k = 0.5 




(i) Maple (j) Exponential (k) Weibull k = 0.7 (1) Weibull k = 0.5 

Figure 9: Waste (y-axis) for the different heuristics as a function of the platform size (x-axis), with 
p — 0.82, r — 0.85, Cp ^ C (first row), Cp = O.IC (second row), or Cp = 2C (third row) and with a 
trace of false predictions parametrized by a uniform distribution. 



is mq{F) = nq{F) on the component hit by the last failure). From Equation ( 18 ) again, we have for each 
component q: 

V.{mq{F)) ^ 1 

Mind 



lim 



Since n{F) = Y^^=i "^qiF)! also have: 



lim 

F^+oo F 



E(n(F)) _ TV 



/^ind 



Equations (191 and (20) lead to the result. 



(20) 

□ 



B 

In this section, we provide results when false predictions are generated according to a uniform distribu- 
tion. 
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(a) Maple 



(b) Exponential 



(c) WeibuU k = 0.7 



(d) WeibuU fc = 0.5 




(e) Maple (f) Exponential (g) WeibuU k = 0.7 (h) WeibuU k = 0.5 




■ji-i 2''' 2"' 2" 2'" 2''' 2" 2'^ 2"' 2'~ 2'^ 2''' 2'"' 2'^ 2"' 2'~ 2'" 2'" 



(i) Maple (j) Exponential (k) WeibuU k = 0.7 (1) WeibuU k = 0.5 

Figure 10: Waste (y-axis) for the different heuristics as a function of the platform size (x-axis), with 
p = 0.4, r = 0.7, Cp — C (first row), Cp = O.IC (second row), or Cp = 2C (third row) and with a trace 
of false predictions parametrized by a uniform distribution.. 
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